Language Model
Learning a good distribution of sentences. Such problem is called language modeling.
Generally, we would like to assume each sentence are independent so that we will have MLE i.e. and the log-likelihood is (the ccross-entropy loss).
Sometimes, we look even deeper where a sentence is a sequence of words. We can apply chain rule of conditional probability where we would have . Again, to simplify the problem, we have Markov assumption (the distribution over next word only depends on the preceding few words). Some property Markov assumption will bring:
- model now is memoryless
- since we care some words, it's a supervised prediction problem
- When we deccompose it into separate prediction problems, it's called an autoregressive model.
n-gram
The simpliest way to estimate the probabilities is from the empirical distribution (i.e. count the frequencies of words). Let's say length with , then such model will be modelling is an -gram language model and such pharse (the words) is called n-gram. The problem of this model:
- expoential computation cost based on the length of the sentence
- most n-gram never appear in the corpus (data sparsity)
Solution to data sparsity
- Use a short context (then will lead less accurate model)
- Smooth the probabilities (i.e. add a small constant to the counts)
- Make predictions using an ensemble of models with different length.
Distributed Representations
Condtional probaility tables (n-gram is one kind of) are kind of localist representation, more specifically, all the information about a particular word is stored in one place. Since different words may related to each other, we may share information between them.
Furthermore, the information about a given word is distrbuted throught the representations is called distributed representation. That is, we can use such representation to share information between words.
The most common model we would like to use is Nerual Language Model which takes previous words and output the next word. The cross-entropy here with predicted probability not the frequency of the word as n-gram.
- We can extend to where is the probability of the -th word being index .
More about nlm.
Let's say we would like to use a 1-of-K encoding for the words, then the first layer will be a linear layer with tied weights which basically act like a look table. Each column of this matrix is the representation of a word (it's called embedding, feature vector or encoding with different emphasis).
- Embedding emphasizes that it's a location in a high-dimensional space where words closer to each other are more semantically similar.
- Feature vector emphasizes that it's a vector of numbers that can be used as input to a machine learning algorithm.
Problems
In high-dimensional embeddings:
- Most vectors are nearly orthogonal to each other.
- Most points are far away from from each other.
Lower-dimensional embeddings:
- 2-D embeddings might be misleading since they can't preserve the distance relationships
Most of times, we the fitting of language models is really hard and it might overkill if all we want is to use the embeddings.
Glove
GloVe (Global Vector) embedding is a simpler and faster approach based on a matrix factorization similar to principal compooent analysis (PCA). According to the distributional hypothesis: words with similar distributions have similar meanings, we may consider a co-occurrence matrix X which counts the number of times two words appear nearby. We have a rank-K approximation where are matrices.
- each row of is the K-dimensional representation of a word.
- each entry
- more similar words are more likely to co-occur
- minimize the squared frobenius norm (PCA). It leads to the GLove embedding cost function where is used to reweight the entries so that only nonzero counts matter to avoid infeasible factorization caused by extremely large. The is used to reduce the effect of most common words.